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^ ■ Abstract 

b '■ 

, We study the problem of compressing massive tables within the partition-training paradigm intro- 

^> ■ duced by Buchsbaum et al. [SODA' 00], in which a table is partitioned by an off-line training procedure 

I into disjoint intervals of columns, each of which is compressed separately by a standard, on-line com- 

■ pressor like gzip. We provide a new theory that unifies previous experimental observations on partition- 
I ing and heuristic observations on column permutation, all of which are used to improve compression 

I— !■ rates. Based on the theory, we devise the first on-line training algorithms for table compression, which 
can be applied to individual files, not just continuously operating sources; and also a new, off-line train- 

■ ing algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior 
C/3 I work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On var- 

^ O ^ ■ ious test files, the on-line algorithms provide 35-55% improvement over gzip with negligible slowdown; 

I the off-hne reordering provides up to 20% further improvement over partitioning alone. We also show 

T-H ■ that a variation of the table compression problem is MAX-SNP hard. 

> ■ 
OO . 

^ ; 1 Introduction 

m ■ 

O ! l-l Table Compression 
(N . 

O ' Table compression was introduced by Buchsbaum et al. [Q] as a unique application of compression, based 

c/2 . on several distinguishing characteristics. Tables are collections of fixed-length records and can grow to 

be terabytes in size. They are often generated by continuously operating sources and can contain much 
redundancy. An example is a data warehouse at AT&T that each month stores one billion records pertaining 
^ ■ to voice phone activity. Each record is several hundred bytes long and contains information about endpoint 

exchanges, times and durations of calls, tariffs, etc. 

The goals of table compression are to be fast, on-line, and effective: eventual compression ratios of 
100:1 or better are desirable. While storage reduction is an obvious benefit, perhaps more important is the 
reduction in subsequent network bandwidth required for transmission. Tables of transaction activity, like 
phone calls and credit card usage, are typically stored once but then shipped repeatedly to different parts of 
an organization: for fraud detection, billing, operations support, etc. 

Prior work distinguishes tables from general databases. Tables are written once and read many times, 
while databases are subject to dynamic updates. Fields in table records are fixed length, and records tend 
to be homogeneous; database records often contain intermixed fixed- and variable-length fields. Finally, 
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the goals of compression differ. Database compression stresses index preservation, the ability to retrieve an 
arbitrary record, under compression Tables are typically not indexed at the level of individual records; 
rather, they are scanned in toto by downstream applications. 

Consider each record in a table to be a row in a matrix. A naive method of table compression is to 
compress the string derived from scanning the table in row-major order. Buchsbaum et al. [Q] observe 
experimentally that partitioning the table into contiguous intervals of columns and compressing each interval 
separately in this fashion can achieve significant compression improvement. The partition is generated by 
a one-time, off-line training procedure, and the resulting compression strategy is applied on-line to the 
table. In their application, tables are generated continuously, so off-line training time can be ignored. They 
also observe heuristically that certain rearrangements of the columns prior to partitioning further improve 
compression, by grouping dependent columns more closely. 

We generalize the partitioning approach into a unified theory that explains both contiguous partition- 
ing and column rearrangement. The theory applies to a set of variables with a given, abstract notion of 
combination and cost; table compression is a concrete case. To test the theory, we design new algorithms 
for contiguous partitioning, which speed training to work on-line on single files in addition to continuously 
generated tables; and for reordering in the off-line training paradigm, which improves the compression rates 
achieved from contiguous partitioning alone. Experimental results support these conclusions. Before sum- 
marizing the results, we motivate the theoretical insights by considering the relationship between entropy 
and compression. 



1.2 Compressive Estimates of Entropy 

Let C be a compression algorithm and C{x) its output on a string x. A large body of work in information the- 
ory establishes the existence of many optimal compression algorithms: i.e., algorithms such that |C(j;)|/|a;|, 
the compression rate, approaches the entropy of the information source emitting x. These results are usu- 
ally established via limit theorems, under some statistical assumptions about the information source. For 



instance, the LZ77 algorithm p2[ ] is optimal for certain classes of sources, e.g., stationary and ergodic 

While entropy establishes a lower bound on compression rates, it is not straightforward to measure 
entropy itself. One empirical method inverts the relationship and estimates entropy by applying a provably 
good compressor to a sufficiently long, representative string. That is, the compression rate becomes a 
compressive estimate of entropy. These estimates themselves become benchmarks against which future 
compressors are measured. Another estimate is the empirical entropy of a string, which is based on the 
probability distribution of substrings of various lengths, without any statistical assumptions regarding the 
source emitting the string. Kosaraju and Manzini [ p3| ] exploit the synergy between empirical entropy and 
true entropy. 

The contiguous partitioning approach to table compression |Q] exemplifies the practical exploitation of 
compressive estimates. Each column of the table can be seen as being generated by a separate source. The 
contiguous partitioning scheme measures the benefit of a particular partition empirically, by compressing 
the table with respect to that partition and using the output size as a cost. Thus, the partitioning method uses 
a compressive estimate of the joint entropy among columns. Prior work [Q] demonstrates the benefit of this 
approach. 



1.3 Method and Results 

We are thus motivated to study table compression in terms of compressive estimates of the joint entropy of 
random variables. In Section ^ we formalize and study two problems on partitioning sets of variables with 
abstract notions of combination and cost; joint entropy forms one example. This generalizes the approach 
of Buchsbaum et al. iQ], who consider the contiguous case only and when applied to table compression. 
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We develop idealized algorithms to solve these problems in the general setting. In Section ^, we apply 
these methods to table compression and derive two new algorithms for contiguous partitioning and one new 
algorithm for general partitioning with reordering of columns. The reordering algorithm demonstrates a link 
between general partitioning and the classical asymmetric traveling salesman problem. We assess algorithm 
performance experimentally in Section ^ 

The new contiguous partitioning algorithms are meant to be fast; better in terms of compression than off- 
the-shelf compressors like gzip (LZ77); but not be as good as the optimal, contiguous partitioning algorithm. 
The increased training speed (compared to optimal, contiguous partitioning) makes the new algorithms 
usable in ad hoc settings, however, when training time must be factored into the overall time to compress. 
We therefore compare compression rates and speeds to those of gzip and optimal, contiguous partitioning. 
For files from various sources, we achieve 35-55% improvement in compression with less than a 1.7-factor 
slowdown, both compared to gzip. For files from genetic databases, which tend to be harder to compress, 
the compression improvement is 5-20%, with slowdown factors of 3-8. 

The performance of the general partitioning with reordering algorithm is predicated on a theorized cor- 
relation between two measures of particular tours in graphs induced by the compression instances. We 
therefore measure this correlation, and the results suggest that the algorithm is nearly optimal (among par- 
titioning algorithms). For several of our files, the algorithm yields compression improvements of at least 
5% compared to optimal, contiguous partitioning without reordering, which itself improves over gzip by 
20-50% for our files. In some cases, the additional improvement approaches 20%. While training time can 
be ignored in the off-line training paradigm, we show the additional time for reordering is not significant. 

Finally, in Sections ^j-^, we give some complexity results that link table compression to the classical 
shortest common superstring problem. We show that an orthogonal (column-major) variation of table com- 
pression is MAX-SNP hard when LZ77 is the underlying compressor. On the other hand, while we also 
show that the row-major problem is MAX-SNP hard when run length encoding (RLE) is the underlying 
compressor, we prove that the column-major variation for RLE is solvable in polynomial time. We conclude 
with open problems and directions in Section |8[ 

2 Partitions of Variables with Entropy-Like Functions 

Let X = {xi, . . . , Xn} be a set of discrete variables, each drawn from some domain T>, and consider 
some function H : V* We use H{X, Y) as a shorthand for H{Z), where Z is the set com- 

posed of all the elements in X and Y: if X and Y are sets, then Z = X VJY; ii X and Y are vari- 
ables, then Z = {X,Y}; etc. For some partition V of X into subsets, define 1-L{V) = '^YeV^^)- 
We are interested in the relationship between H{X) and T-L{V). For example, let X be a vector of ran- 
dom variables with joint probability distribution p{X). Two vectors X and Y are statistically independent 
if and only if p{x,y) = p{x)p{y), for all otherwise, X and Y are statistically dependent. Let 

H{X) = — Yl{xi a;„}^'(^i' • • • ' logp(a;i, • • • , Xn) be the joint entropy of X. Then it is well known 
that for any partition V of X, H{X) < 7{{V), with equality if and only if all the subsets in P are 
mutually independent. 

We can also view a table of n columns as a system of n variables. The relationship between certain 
compressors and entropy suggests that certain rearrangements that group functionally dependent columns 
will lead to better compression; Buchsbaum et al. [Q] observe this in practice while restricting attention to 
partitions that preserve the original order of columns. 

We are thus motivated to consider generally how to partition a system of variables optimally; i.e., to 
achieve a partition V of X that minimizes 7i{V), for some function H{-), which we generally call the cost 
function. We introduce the following definitions. We call an element of V, which is a subset of X, a class. 
We define two variables or sets of variables X and X' to be combinatorially dependent if H{X, X') < 



3 



H{X) + H{X'); otherwise, X and X' are combinatorially independent. When H{-) is the entropy function 
over random variables, combinatorial dependence becomes statistical dependence. Considering unordered 
sets implies that H{X,X') = H{X',X). Note that in general it is possible that H{X,X') > H{X) + 
H{X'), although not when H{-) is the entropy function over random variables. Finally, we define a class 
Y to be contiguous if Xj G y and Xj S Y for any i < j implies that Xj+i € Y and a partition V to be 
contiguous if each y G P is contiguous. We now define two problems of finding optimal partitions of T. 

Problem 2.1 Find a contiguous partition V of X minimizing Ti.{V) among all such partitions. 



Problem 2.2 Find a partition V of X minimizing 7i(V) among all partitions. 



Clearly, a solution to Problem 2.2 is at least as good in terms of cost as one to Problem 2.1. Problem 



2.1 has a simple, fast algorithmic solution, however. Problem 2.2, while seemingly intractable, has an 
algorithmic heuristic that seems to work well in practice. 

Assume first that combinatorial dependence is an equivalence relation on X. This is not necessarily true 
in practice, but we study the idealized case to provide some intuition for handling real instances, when we 
cannot determine combinatorial dependence or even calculate the true cost function directly. 

Lemma 2.3 If combinatorial dependence is an equivalence relation on X, then the partition V of X into 
equivalence classes Ci, . . . , solves Problem I 



Proof. Consider some partition V ^ V; we show that T-L{V) < 7i{V'). Assume there exists a class C' G V' 
such that C D Ci for some 1 <i < k. Partition C into subclasses C{ , . . . , such that for each Cj there 
is some such that C d. Let V" = {V \ {C}) U {C(, . . . , C'^}. Since the Cj's are equivalence 

classes, the Cj's are mutually independent, so H{C') > Yfj=i H{C'j), which implies n{V") < n{V'). Set 
V' ^ V" , and iterate until no such C exists in V' . 

If no such C exists in V' , then either V' = V, and we are done, or else V' contains two classes 
C and D' such that C U D' C d for some i. The elements in C and D' are mutually dependent, so 
H{C', D') < H{C') + H{D'). Unite each such pair of classes until V = V. □ 

Lemma gives a simple algorithm for solving Problem 2.2 when combinatorial dependence is an 



equivalence relation that can be computed: partition X according to the induced equivalence classes. When 
combinatorial dependence is not an equivalence relation, or when we can only calculate H{-) heuristically, 
we seek other approaches. 

2.1 Solutions Without Reordering 

In the general case, irrespective of whether combinatorial dependence is an equivalence relation, we can 



solve Problem 2.1 by dynamic programming. Let E[i\ be the cost of an optimal, contiguous partition of 



variables xi,... E[n] is thus the cost of a solution to Problem lA. Define £^[0] = 0; then, for 

1 < i < n. 



E[j] = min E[3\ + H{xj+i, . . . (1) 

0<j<i 

The actual partition with cost E[n] can be maintained by standard dynamic programming backtracking. 

If combinatorial dependence actually is an equivalence relation and all dependent variables appear con- 
tiguously in X, a simple greedy algorithm also solves the problem. Start with class Ci = {xi}. In general. 
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let i be the index of the current class and j be the index of the variable most recently added to Cj. While 
j < n, iterate as follows. If H{Ci U {xj^i}) < H{Ci) + H{xj^i), then set Cj <— Cj U {xj^i}; otherwise, 
start a new class, Cj+i = {xj+i}. An alternative algorithm assigns, for 1 < i < n, Xj and Xj+i to the same 
class if and only if H{xi,Xi+i) < H{xi) + We call the resulting partition a greedy partition; 

formally, a greedy partition is one in which each class is a maximal, contiguous set of mutually dependent 
variables. 



Lemma 2.4 If combinatorial dependence is an equivalence relation and all combinatorially dependent 



variables appear contiguously in X, then the greedy partition solves Problems 2.1 and 2.2 



Proof. By assumption, the classes in a greedy partition correspond to the equivalence classes of X. Lemma 
2.3 thus shows that the greedy partition solves Problem 22. Contiguity therefore implies it also solves 
Problem 0|. □ 



2.2 Solutions with Reordering 



Problem asks for the best way to partition the variables in T, ignoring contiguity constraints. While a 
general solution seems intractable, we give a combinatorial approach that admits a practical heuristic. 

Define a weighted, complete, undirected graph, G{X), with a vertex for each Xi G X; the weight of edge 
{xi,Xj} is w{xi,Xj) = m.m{H {xi, Xj) , H (xi) + H{xj)). Let P = {vq, ... ,Vi)he any path in G{X). The 
weight of P is w{P) = YliZo ^(^i) ^i+i)- We apply the cost function H{-) to define the cost of P. Consider 
removing all edges {u, v} from P such that u and v are combinatorially independent. This leaves a set of 
disjoint paths, S{P) = {Pi,... ,Pk} for some k. We define the cost of P to be n{P) = Yli=i H{Pi), 
where Pi is taken to be the unordered set of vertices in the corresponding subpath. If P is a tour of G{X), 
then S{P) corresponds to a partition of X. 

We establish a relationship between the cost and weight of a tour P. Assume there are two distinct paths 
Pi = {uq, ... 5 life) and Pj = {vq, . . . ,V£) in S{P) such that Uk and vq are combinatorially dependent and 
Vq follows Uk in p. In P exist the edges {uk,x}, {y, vq}, and {v£, z}. We can transform P into a new tour 
P' that unites Pi and Pj by substituting for these three edges the new edges: {uk,vo}, {ve, x}, and {y, z}. 
We call this a path coalescing transformation. The following lemma shows that it is a restricted form of the 
standard traveling salesman 3-opt transformation, in that it always reduces the cost of a tour It is restricted 
by the stipulation that Uk and vq be combinatorially dependent. 



Lemma 2.5 If P' is formed from P by a path coalescing transformation, then w{P') < w{P). 



Proof. Consider 

w{uk,x) +w{y,vo) +w{vi,z) (2) 

and 

w{uk,VQ) +w{vi,x) + w{y,z). (3) 

We have w{P') - w{P) = (|) - @. The definition of S{P) implies that @ = H{uk) + H{x) + H{y) + 
H{vq)+H{v()+H{z). That Uk and vq are combinatorially dependent implies w{uk,vo) < H{uk)+H{vQ). 
Since w{X, Y) < H{X) + H{Y) for any X and Y, we conclude that (|) < (|). □ 
Repeated path coalescing groups combinatorially dependent variables. If a tour P admits no path co- 
alescing transformation, and if combinatorial dependence is an equivalence relation on X, then we can 



conclude that P is optimal by Lemma 23 That is, S{P) corresponds to an optimal partition of X, which 
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solves Problem 22. Furthermore, Lemma 2^ implies that a minimum weight tour P admits no path coa- 
lescing transformation. 

When H{-) is sub-additive, i.e., H{X, Y) < H{X) + H(Y), as is the entropy function, a sequence of 
path coalescing transformations yields a sequence of paths of non-increasing costs. That is, in Lemma 2.5, 
w{P') < w{P) and H{P') < T~C{P). We explore this connection between the two functions below, when 
we do not assume that combinatorial dependence is an equivalence relation or even that H{-)is sub-additive. 



3 Partitions of Tables and Compression 

We apply the results of Section ^ to table compression. Let T be a table of n = |T[ columns and some fixed, 
arbitrary number of rows. Let T[i] denote the i'th column of T. Given two tables Ti and T2, let T1T2 be the 
table formed by their juxtaposition. That is, T = T1T2 is defined so that T[i] = Ti[i] for 1 < i < |Ti[ and 
T[i\ = r2[z - iTil] for iTil < i < jTil + |r2l. Any column is a one-column table, so T[{\T[j] is the table 
formed by projecting the i'th and j'th columns of T; and so on. We use the shorthand T[i,j] to represent 
the projection T[i] - ■ ■ T[j] for some j > i. 



Fix a compressor C: e.g., gzip, based on LZ77 [22]; compress, based on LZ78 [20, E3]; or bzip, based 



on Burrows-Wheeler [|5|]. Let Hc{T) be the size of the result of compressing table T as a string in row- 
major order using C. Let Hc{Ti,T2) = Hc{TiT2). Hc{-) is a cost function as discussed in Section ^, 
and the definitions of combinatorial dependence and independence apply to tables. In particular, two tables 
Ti and T2, which might be projections of columns from a common table T, are combinatorially depen- 
dent if Hc{Ti,T2) < Hc{Ti) + Hc{T2) — if compressing them together is better than compressing them 
separately — and combinatorially independent otherwise. 



Problems 2.1 and 2.2 now apply to compressing T. Problem 2.1 is to find a contiguous partition of T 



into intervals of columns minimizing the overall cost of compressing each interval separately. Problem 



is to find a partition of T, allowing columns to be reordered, minimizing the overall cost of compressing 



each interval separately. Buchsbaum et al. [^] address Problem |2JJ experimentally and leave Problem ^2 
open save for some heuristic observations. 

A few major issues arise in this application. Combinatorial dependence is not necessarily an equivalence 
relation. It is not necessarily even symmetric, so we can no longer ignore the order of columns in a class. 
Also, Hc{-) need not be sub-additive. If C behaves according to entropy, however, then intuition suggests 
that our partitioning strategies will improve compression. Stated conversely, if Hc{T) is far from H{T), 
the entropy of T, there should be some partition P of T so that Hc{P) approaches H{T), which is a lower 
bound on Hc{T). We will present algorithms for solving these problems and experiments assessing their 
performance. 

3.1 Algorithms for Table Compression without Rearrangement of Columns 

The dynamic programming solution in Equation (|l]) finds an optimal, contiguous partition solving Problem 



2.1[ Buchsbaum et al. [^] demonstrate experimentally that it effectively improves compression results, and 
we will use their method as a benchmark. 

The dynamic program, however, requires Q{n?) steps, each applying C to an average of 0(n) columns, 
for a total of Q{n'^) column compressions. In the off-line training paradigm, this optimization time can be 
ignored. Faster algorithms, however, might allow some partitioning to be applied when compressing single, 
tabular files in addition to continuously generated tables. 



The greedy algorithms from Section gJ| apply directly in our framework. We denote by GREEDY the 



algorithm that grows class Ci incrementally by comparing Hc{CiT[j + 1]) and Hc{Ci) + Hc{T[j + 1]). We 
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denote by GREED YT the algorithm that assigns T[i] and T[i + 1] to the same class when Hc{T[i, i + l]) < 
Hc{Ti) + Hc{T[i + l\). 

GREEDY performs 2(n — 1) compressions, each of G(n) columns, for a total of ©(n^) column com- 
pressions. GREEDYT performs 2(n — 1) compressions, each of one or two columns, for a total of Q{n) 
column compressions, asymptotically at least as fast as applying C to T itself. 

Even though combinatorial dependence is not an equivalence relation, we hypothesize that GREEDY 
and GREEDYT will produce partitions close in cost to the optimal contiguous partition produced by the 
dynamic program. We present experimental results testing this hypothesis in Section ^. 



3.2 Algorithms for Table Compression with Rearrangement of Columns 

We now consider Problem [2.2[ Assuming that combinatorial dependence is not an equivalence relation, to 
the best of our knowledge, the only known algorithm to solve it exactly consists of generating all n\ column 
orderings and applying the dynamic program in Equation (|l|) to each. The relationship between compression 
and entropy, however, suggests that the approach in Section ^ can still be fruitfully applied. 

Recall that in the idealized case, an optimal solution corresponds to a tour of G{T) that admits no path 
coalescing transformation. Furthermore, such transformations always reduce the weight of such tours. The 
lack of symmetry in Hc{-) further suggests that order within classes is important: it no longer suffices to 
coalesce paths globally. 

We therefore hypothesize a strong, positive correlation between tour weight and compression cost. This 
would imply that a traveling salesman (TSP) tour of G{T) would yield an optimal or near-optimal partition 
of T. To test this hypothesis, we generate a set of tours of various weights, by iteratively applying stan- 
dard optimizations (e.g., 3-opt, 4-opt). Each tour induces an ordering of the columns, which we optimally 
partition using the dynamic program. We present results of this experiment in Section ^ 



4 Experiments 
4.1 Data 

We report experimental results on several data sets. The first three of the following are used by Buchsbaum 
etal. 

CARE is a collection of 90-byte records from a customer care database of voice call activity. 

NETWORK is a collection of 32-byte records from a system of network status monitors. 

CENSUS is a portion of the United States 1990 Census of Population and Housing Summary Tape File 3A 
[^]. We used field group 301, level 090, for all states. Each record is 932 byes. 

LERG is a file from Telcordia's database describing local telephone switches. We appended spaces as nec- 
essary to pad each record to a uniform 30 bytes. 

We also use several files from genetic databases, which are growing at a fast pace and pose unique 



challenges to compression systems [ ]11[ , |17| ]. These files can be viewed as two-dimensional, alphanumeric 
tables representing multiple alignments of proteins (amino acid sequences) and genomic coding regions 
(DNA sequences). 

The files EGF, LRR, PF00032, BACkPQQ, CALLAGEN, and CBS come from the Pfam database of 
multiple alignments of protein domains or conserved protein functions [^. Its main function is to store 
information that can be used to determine whether a new protein belongs to an existing domain or family. It 
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Table 1: Files used in our experiments. Bpr is bytes per record. Size is the original size of the file in bytes. 
Training size is the ratio of the size of the training set to that of the test set. Gzip and DP report compression 
results; DP is the optimal contiguous partition, calculated by dynamic programming. For each, Size is the 
size of the compressed file in bytes, and Rate is the ratio of compressed to original size. DP/Gzip shows the 
relative improvement yielded by partitioning. 
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contains more than 1800 protein families and has many mirror sites. The size of each table can range from a 
few columns and rows to hundreds of columns and thousands of rows. We have chosen multiple alignments 
of different sizes and representing protein domains with differing degrees of conservation: i.e., how close 
two members of a family are in terms of matching characters in the alignment. 

The file CYTOB is from the AMmtDB database of multi-aligned sequences of Vertebrate mitochondrial 
genes for coding proteins [16]. It contains data from 888 different species and over 1100 multi-alignments of 
protein-coding genes. The tables corresponding to the alignments tend to have rows in the order of hundreds 
and columns in the order of thousands, much wider than the other files we consider. We have experimented 
with one multiple alignment: CytoB represents the coding region of the mitochondrial gene (from 500 
different species) of cytocrome B. 

Table [l| details the sizes of the files and how well gzip and the optimal partition via dynamic program- 
ming (using gzip as the underlying compressor) compress them. We use the pin/pzip system described by 
Buchsbaum et al. [Q] to general optimal, contiguous partitions. For each file, we run the dynamic program 
on a small training set and compress the remainder of the data, the test set. Gzip results are with respect to 
the test sets only. Buchsbaum et al. [Q] investigate the relationship between training size and compression 
performance and demonstrate a threshold after which more training data does not improve performance. 
Here we simply use enough training data to exceed this threshold and report this amount in Table |l[ The 
training and test sets remain disjoint to support the validity of using a partition from a small amount of 
training data on a larger amount of subsequent data. In a real application, the training data would also be 
compressed. 

All experiments were performed on one 250 MHz RIOOOO processor in a 24-processor SGI Challenge, 
with 14 GB of main memory. Each time reported is the medians of five runs. 
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Table 2: Performance of GREEDY and GREEDYT. For each, Size is the size of the compressed file using 
the corresponding partition; Rate is the corresponding compression rate; /Gzip is the size relative to gzip; 
and /DP is the size relative to using the optimal, contiguous partition. 
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0.0650 


0.7046 


1 AA< 1 


LERG 


197821 


0.0568 


0.4348 


1.0644 


199246 


0.0573 


0.4379 


1.0720 


EGF 


57016 


0.1068 


0.7885 


1.0079 


61178 


0.1146 


0.8461 


1.0814 


LRR 


49778 


0.2114 


0.8062 


1.0148 


49393 


0.2098 


0.8000 


1.0069 


PF00032 


31037 


0.0771 


0.9069 


1.0147 


31390 


0.0780 


0.9172 


1.0263 


backPQQ 


7761 


0.3472 


1.0337 


1.0800 


7761 


0.3472 


1.0337 


1.0800 


CALLAGEN 


58952 


0.2428 


0.8755 


0.9934 


56313 


0.2319 


0.8363 


0.9489 


CBS 


21571 


0.2922 


0.9295 


1.0873 


21939 


0.2971 


0.9454 


1.1059 


CYTOB 


94128 


0.1625 


0.8582 


1.0461 


113160 


0.1953 


1.0317 


1.2576 



4.2 Greedy Algorithms 

Our hypothesis that GREEDY and GREEDYT produce partitions close in cost to that of the optimal, con- 
tiguous partition, if true implies that we can substitute the greedy algorithms for the dynamic program (DP) 
in purely on-line applications that cannot afford off-line training time. We thus compare compression rates 
of GREEDY and GREEDYT against DP and gzip, to assess the quality of the partitions; and we compare the 
time taken by GREEDY and GREEDYT (partitioning and compression) against gzip, to assess tractability. 
Table shows the resulting compressed sizes using partitions computed with GREEDY and GREEDYT. 
Table ^ gives the time results. 

GREEDY compresses to within 2% of DP on seven of the files, including four of the genetic files. It 
is never more than 9% bigger than DP, and with the exception of BACKPQQ, always outperforms gzip. 
GREEDYT comes within 10% of DP on seven files, including four genetic files and outperforms gzip 
except on BACKPQQ and CYTOB. Both GREEDY and GREEDYT seem to outperform DP on CALLAGEN, 
although this would seem theoretically impossible. It is an artifact of the training/testing paradigm: we 
compress data distinct from that used to build the partitions. 

Tables ^ and ^ show that in many cases, the greedy algorithms provide significant extra compression at 
acceptable time penalties. For the non-genetic files, greedy partitioning compression is less than 1.7 slower 
than gzip yet provides 35-55% more compression. For the genetic files, the slowdown is a factor of 3-8, and 
the extra compression is 5-20% (ignoring BACKPQQ). Thus, the greedy algorithms provide a good on-line 
heuristic for improving compression. 

4.3 Reordering via TSP 

Our hypothesis that tour weight and compression are correlated implies that generating a TSP tour (or 
approximation) would yield an optimal (or near optimal) partition. Although we do not know what the 
optimal partition is for our files, we can assess the correlation by generating a sequence of tours and, for 
each, measuring the resulting compression. We also compare the compression using the best partition from 
the sequence against that using DP on the original ordering, to gauge the improvement yielded by reordering. 
For each file, we computed various tours on the corresponding graph G( ). We computed a close approx- 
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Table 3: On-line performance of GREEDY and GREEDYT. For each, Time is the time in seconds to compute 
the partition and compress the file; /Gzip is the time relative to gzip. 





Gzip 


GREEDY 


GREEDYT 


File 


Time 


Time 


/Gzip 


Time 


/Gzip 


CARE 


5.0260 


7.1020 


1.4131 


6.4340 


1.2801 


NETWORK 


15.0000 


25.3790 


1.6919 


24.2750 


1.6183 


CENSUS 


126.6450 


160.7960 


1.2697 


A A'~l 1 C\Of\ 

147.1980 


1.1623 


LERG 


1.3 /30 


2.2800 


1.4495 


2.3080 


1.4673 


EGF 


0.2350 


0.8030 


3.4170 


0.7250 


3.0851 


LRR 


0.1260 


0.4530 


3.5952 


0.4450 


3.5317 


PF00032 


0.1320 


0.8950 


6.7803 


0.6290 


4.7652 


backPQQ 


0.0180 


0.3090 


17.1667 


0.3260 


18.1111 


CALLAGEN 


0.2500 


0.6050 


2.4200 


0.5300 


2.1200 


CBS 


0.0530 


0.4260 


8.0377 


0.4020 


7.5849 


CYTOB 


0.8230 


3.7330 


4.5358 


2.1830 


2.6525 



imation to a TSP tour using a variation of Zhang's branch-and-bound algorithm |21], discussed by Cirasella 
et al. [0]. We also computed a 3-opt local optimum tour; and we used a 4-opt heuristic to compute a se- 
quence of tours of various costs. Each tour induced an ordering of the columns. For each column ordering, 
we computed the optimal, contiguous partition by DP, except that we used GREEDYT on the orderings for 
CENSUS, due to computational limitations. Figures [l] and ^ plot the results. 

The plots demonstrate a strong, positive correlation between tour cost and compression performance. 
In particular, each plot shows that the least-cost tour (produced by Zhang's algorithm) produced the best 
compression result. Table § details the compression improvement from using the Zhang ordering. In five 
files, Zhang gives an extra compression improvement of at least 5% over DP on the original order; for 
CYTOB, the improvement is 20%. That the original order for NETWORK outperforms the Zhang ordering is 
again an artifact of the training/test paradigm. Figure |l| shows that the tour-cost/compression-performance 
correlation remains strong for this file. 

Table ^displays the time spent computing Zhang's tour for each file. This time is negligible compared to 
the time to compute the optimal, contiguous partition via DP. (The DP time on CENSUS is 168531 seconds, 
four orders of magnitude larger. For CYTOB, the DP time is 8640 seconds, an order of magnitude larger.) 
Table ^ also shows that Zhang's tour always had cost close to the Held-Karp lower bound [|l^, |l^] on the 
cost of the optimum TSP tour. 

For off-line training, therefore, it seems that computing a good approximation to the TSP reordering 
before partitioning contributes significant compression improvement at minimal time cost. Furthermore, the 
correlation between tour cost and compression behaves similarly to what the theory in Section 12 would 
predict if Hc{-) were sub-additive, which suggests the existence of some other, similar structure induced by 
Hc{-) that would control this relationship. 



5 Complexity of Table Compression: A General Framework 

We now introduce a framework for studying the computational complexity of several versions of table 
compression problems. We start with a basic problem of finding an optimal arrangement of a set of strings 
to be compressed. Given a set of strings, we wish to compute an order in which to catenate the strings into 
a superstring X so as to minimize the cost of compressing X using a fixed compressor C. To isolate the 
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■ Zhang 

♦ 3opt 
o 4opt 



♦ 3opt 
o 4opt 



360000 380000 



400000 420000 440000 460000 

care 



550000 

census 



■ Zhang 
♦ 3opt 
o 4opt 



■ Zhang 
♦ 3opt 
O 4opt 



11000 12000 

EGF 



2000000- 

120000 

230000- 



■ Zhang 

♦ 3opt 
o 4opt 



34000 36000 38000 40000 42000 44000 

IMS . „ . 



■ Zhang 

♦ 3opt 
o 4opt 



1 1000 12000 
LRR 



Figure 1 : Relationship between tour cost (x-axes) and compression size (y-axes) for CARE, NETWORK, CEN- 
SUS, LERG, EGF, and LRR, using the result of Zhang's algorithm, a 3-opt local optimum, and a sequence 
of tours generated by a series of 4-opt changes. 
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■ Zhang 

♦ 3opt 
O 4opt 



10000 

PF00032 



■ Zhang 
• 3opt 
o 4opt 



12000 13000 14000 15000 

callagen 



o 



■ Zhang 
♦ 3opt 
O 4opt 



3000 3200 

backPQQ 



6500 7000 

cbs 



■ Zhang 
• 3opt 
o 4opt 



■ Zhang 
• 3opt 
o 4opt 



75000- 



32000 34000 36000 38000 40000 42000 

cytoB 

Figure 2: Relationship between tour cost (x-axes) and compression size (y-axes) for PF00032, backPQQ, 
CALLAGEN, CBS, and C YTOB, using the resuh of Zhang's algorithm, a 3-opt local optimum, and a sequence 
of tours generated by a series of 4-opt changes. 
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Table 4: Performance of TSP reordering. For each, Size is the size of the compressed file using the Zhang 
ordering and optimal, contiguous partition (for CENSUS, using the GREED YT partition); Rate is the corre- 
sponding compression rate; /Gzip is the size relative to gzip; and /DP is the size relative to using the optimal, 
contiguous partition on the original ordering. 





TSP 








File 


Size 


Rate 


/Gzip 


/DP 


CARE 


1199315 


0.1466 


0.5890 


0.9290 


NETWORK 


1822065 


0.0299 


0.4859 


1.0249 


CENSUS 


18113740 


0.0544 


0.5901 


0.8419 


LERG 


183668 


0.0528 


0.4037 


0.9882 


EOF 


50027 


0.0937 


0.6919 


0.8843 


LRR 


48139 


0.2045 


0.7796 


0.9814 


PF00032 


29625 


0.0736 


0.8656 


0.9685 


backPQQ 


7131 


0.3190 


0.9498 


0.9923 


CALLAGEN 


51249 


0.2111 


0.7611 


0.8636 


CBS 


19092 


0.2586 


0.8227 


0.9623 


CYTOB 


71529 


0.1234 


0.6522 


0.7947 



Table 5: For each file, the quality of Zhang's tour is expressed as per cent above the Held-Karp lower bound. 
Time is the time in seconds to compute the tour. 



File 


% above HK 


Time 


CARE 


0.438 


0.110 


NETWORK 


0.602 


0.230 


CENSUS 


0.177 


28.500 


LERG 


0.011 


0.010 


EGF 


0.314 


0.450 


LRR 


0.354 


0.050 


PF00032 


0.211 


0.510 


BACKPQQ 


0.196 


0.050 


CALLAGEN 


0.152 


0.170 


CBS 


0.187 


0.210 


CYTOB 


0.027 


735.440 
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complexity of finding an optimal order, we restrict C to prevent it from reordering the input itself. 

Let X = o"! • • • (T„ be a string over some alphabet S, and let C{x) denote the output of C when given 
input X. We allow C arbitrary time and space, but we require that it process x monotonically. That is, it 
reads the symbols of x in order; after reading each symbol, it may or may not output a string. Let C{x)j 
be the catenation of all the strings output, in order, by C after processing ai - ■ ■ aj. If C actually outputs a 
(non-null) string after reading aj, then we require that C{x)j must be a prefix of C{ai ■ ■ ■ ajy) for any suffix 
y. We assume a special end-of-string character not in S that implicitly terminates every input to C. 

Intuitively, this restriction precludes C from reordering its input to improve the compression. Many 
compression programs used in practice work within this restriction: e.g., gzip and compress. 

We use |C(x)| to abstract the length of C{x). A common measure is bits, but other measures are more 
appropriate in certain settings. For example, when considering LZ77 compression [^], we will denote by 
\C{x)\ the number of phrases in the LZ77 parsing of x, which suffices to capture the length of C{x) while 
ignoring technical details concerning how phrases are encoded. 

Let X = {xi, ... be a set of strings. A batch of X is an ordered subset of X. A schedule of 
X is a partition of X into batches. A batch B = (xj^ , . . . , Xj J is processed by C by computing C{B) = 
C{xi^ ■ ■ ■ XiJ; i.e., by compressing the superstring formed by catenating the strings in B in the order given. 
A schedule 5 of X is processed by C by processing its batches, one by one, in any order. While C{S) is 
ambiguous, |C(5)| = X^^g^ 1^(^)1 well defined. Our main problem can be stated as follows. 

Problem 5.1 Let X be a set of strings. Find a schedule S of X minimizing \C{S) \ among all schedules. 



The classical shortest common superstring (SCS) problem can be phrased in terms of Problem For 
two strings x and y, let pref{x, y) be the prefix of x that ends at the longest suffix-prefix match of x and 
y. Let X be a set of n strings, and let vr be a permutation of the integers in [l,n]. Define S{X,it) = 
pref{xTr-^,XTr2)pref{xTy2J^T^3) ' ' ' P^^f{^T^n-\^^-'^n)^T^n- S{X,7r) is a superstring of X; vr corresponds to a 
schedule of X; and the SCS of X is S{X, vr) for some vr jr2|]. Therefore, finding the SCS is an instance of 



Problem |1[ where C(-) is S{-). Since finding the SCS is MAX-SNP hard Problem is MAX-SNP 



hard in general. Different results can hold for specific compressors, however. 

We now formalize table compression problems in this framework. Consider a table T with m rows and 
n columns, each entry a symbol in S. Let be the string formed by catenating the columns of T in order; 
let be the string formed by catenating the rows of T in order. 

We view T as a set of columns {T[l], . . . , r[n]}. A batch B = (T[ii], . . . , T[is]) then corresponds to 
a table Tb = T[ii] ■ ■ ■ T[is], which we can compress in column- or row-major order. A column-major order 
schedule S'^ of T has compression cost l^'^l = ^^eS" ^ row-major order schedule 5*" of T has 

compression cost \S^\ = ^^es^' I^C^b)!- 

Problem 5.2 Given a table T, find a column-major schedule S'^ ofT minimizing \S'^{T) \ among all such 
schedules. 



Problem 5.3 Given a table T, find a row-major schedule S^' of T minimizing |5'"(T)| among all such 
schedules. 



In either column- or row-major order, batches of T are subsets of columns. In column-major order, 
each column of T remains a distinct substring in any schedule. In row-major order, however, the individual 
strings that form a schedule are the row-major renderings of batches of T. This distinction is subtle yet 



crucial. Problem 5.2 becomes equivalent to Problem 5.1, so we may consider the latter in order to establish 



lower bounds for the former. Problem however, is not identical to problem p. 1| : the row-major order 
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rendering of the batches results in input strings being intermixed. We emphasize this distinction in Section 
0, where we show that, when C is run length encoding, Problem 5^ can be solved in polynomial time, while 
Problem ^ is MAX-SNP hard. The connection between table compression and SCS through Problem 5J_ 
makes these problems theoretically elegant as well as practically motivated. 



6 Complexity with LZ77 



We use the standard definitions of L-reduction and MAX-SNP |18]. Let A and B be two optimization 
(minimization or maximization) problems. Let costA{y) be the cost of a solution y to some instance of A; 
let optA{x) be the cost of an optimum solution for an instance x of A; and define analogous metrics for B. 
A L-reduces to B if there are two polynomial-time functions / and g and constants a, /? > such that: 

(1) Given an instance a of A, f{a) is an instance of B such that optB{f{a)) < a ■ optA{a); 

(2) Given a solution y to /(a), g{y) is a solution to a such that \costA{g{y)) — optA{a)\ < j3\costB{y) — 
optB{f{a))\. 

The composition of two L-reductions is also an L-reduction. A problem is MAX-SNP hard [|l^] if 
every problem in MAX-SNP can be L-reduced to it. If A L-reduces to B, then if B has a polynomial-time 
approximation scheme (PTAS), so does A. A MAX-SNP hard problem is unlikely to have a PTAS 



Now recall the LZ77 parsing rule []22|], which is used by compressors like gzip. Consider a string z, and, 
if l^^l > 1, let z~ denote the prefix of z of length \z\ — 1. If \z\ > 2, then define z = {z~)~. 

LZ77 parses z into phrases, each a substring of z. Assume that LZ77 has already parsed the prefix 
zi • ■ ■ Zi-i of z into phrases zi, . . . , Zi-i, and let z' be the remaining suffix of z. LZ77 selects the z'th 
phrase Zi as the longest prefix of z' that can be obtained by adding a single character to a substring of 
{zi ■ ■ ■ Zi-iZi)~. Therefore, Zi has the property that z^ is a substring of {ziZ2 ■ ■ ■ zi-iZi) , but Zj is not a 
substring of {ziZ2 ■ ■ ■ Zj-iZj)". This recursive definition is sound [[l5|]. 

After parsing Zi, LZ77 outputs an encoding of the triplet {pi, £i,ai), where pi is the starting position of 
z^ in ziZ2 - ■ ■ Zj_i; £i = \zi\ — 1; and Oi is the last character of Zj. The length of the encoding is linear in 
the number of phases, so when C is LZ77, we denote by \C{z)\ the number of phrases in the parsing of z. 
This cost function is commonly used to establish the performance of LZ77 parsing [§, 15|. 



6.1 Problem 5.1 



We show that Problem ^ is MAX-SNP hard when C is LZ77. Consider TSP(1,2), the traveUng salesman 
problem on a complete graph where each distance is either 1 or 2. An instance of TSP(1,2) can be specified 
by a graph H, where the edges of H connect those pairs of vertices with distance 1 . The problem remains 
MAX-SNP hard if we further restrict the problem so that the degree of each vertex in H is bounded by some 
arbitrary but fixed constant [[l9|]. This result holds for both symmetric and asymmetric TSP(1,2); i.e., for 
both undirected and directed graphs H. We assume that H is directed. The following lemma shows that we 
may also assume that no vertex in H has outdegree 1. 

Lemma 6.1 TSP(1,2) L-reduces to TSP(1,2) with the additional stipulation that no vertex has only one 
outgoing cost-1 edge. 

Proof. Consider instance A of TSP(1,2). For each vertex v with only one outgoing cost-1 edge, to some 
v', we create a new vertex v" such that edges (u, v"), [v", v), and {v", v') have cost 1 and all other edges 
incident on v" have cost 2. Thus we form instance B. A solution to B is mapped to a solution to A by 
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splicing out all newly created vertices. If A has n vertices, then B has at most 2n vertices. All solutions to 
both have cost 0{n), so we need only prove that the reverse mapping of solutions preserves optimality. 

Assume Sb is an optimal solution to B and Sa the corresponding mapped solution to A. Note that 
cost{SA) < cost{SB) — where t] is the number of vertices created to form B. (We drop the subscripts to 
cost{-), as there is no ambiguity.) If Sa is not optimal, there is some S"^ such that cost{S'y^) < cost{SA)- 
We can form a solution S'g to B by replacing each edge (f , z) in S'j^, where v has only one cost-1 outgoing 
edge, with edges {v,v') and {v' ,z). This gives cost{S'^) = cost{S'j^ + rj < cost{SA) + ^ < cost{SB), 
contradicting the optimality of 5^- n 

We associate a set S{H) of strings to the vertices and edges of H; S{H) will be the input to Problem 



5.1 . Each vertex v engenders three symbols: v, v', and $„. Let wq, . . . , Wd-i be the vertices on the edges 
out of V in H, in some arbitrary but fixed cyclic order. For < i < d and mod-d arithmetic, we say that 
edge {v,Wi) cyclicly precedes edge {v^wij^i). The d + 1 strings we associate to v and these edges are: 
e{v, Wi) = {v'wi-i)^v'wi, for < i < d and mod-d arithmetic; and s{v) = v'^{v')^$v That d I implies 
that Wi / Wi+i when d ^ 0, 

To prove MAX-SNP hardness, we first show how to transform a TSP(1,2) solution for H into a solution 



to Problem qJJ with input S{H). We then show how to transform in polynomial time a solution to Problem 
5.1 into a TSP(1,2) solution of a certain cost. We use the intermediate step of transforming the first solution 
into a canonical form of at most the same cost. 

The canonical form solution will correspond to the required TSP(1,2) tour. We will show that, for all 
edges {v, w), e{v, w) will parse into one phrase when immediately preceded by e(f , y) for the edge {v, y) 
that cyclicly precedes {v, w), and into more than one phrase otherwise; and we will show that s{v) will parse 
into two phrases when immediately preceded by e{x, v) for some edge {x, v), and into three phrases other- 
wise. Thus, an edge {v, Wi) in the path will best be encoded as s{v)e{v, Wj+i)e(?;, Wi+2) ■ ■ ■ e{v, Wi)s{wi). 
This is the core idea of our canonical form. 



Lemmas q^^-^^ provide a few needed facts about the parsing of strings in S{H). In what follows, X 



denotes both a batch in S{H) and the string obtained by catenating the strings in the batch in order. 

Lemma 6.2 Let X = xi ■ ■ ■ Xg be a batch of S{H), where each xi is s{v) for some vertex v or e{v,w) for 
some edge {v, w). For each 1 < j < s, some phrase in the LZ77 parsing of X ends at the last symbol ofxj. 

Proof. The proof is by induction. The base case is for j = 1. If xi = s{v) for some vertex v, then the 
lemma holds, because $y appears only at the end of s{v). Otherwise, xi = e{v, wi) for some edge (f , Wi). 
Since xi appears first in X, its parsing is v' , Wi-i, {v'wi-i)^v'wi. (That no vertex has outdegree one implies 
that Wi 7^ Wi-i.) The lemma again holds. 

Assume by induction that the lemma is true up through the parsing of xj-i, we show that it holds for 
the parsing through Xj. Again, if xj = s{v) for some vertex v, the lemma is true, because $y appears only 
at the end of s{v). Otherwise, Xj = e{v, Wi), for some edge , Wi). There are two cases. 

1. Xj-i = e{v,Wi-i). Then Xj^ixj = {v'wi-2)'^{v'wi-i)^v'wi. By induction, a phrase ends at the first 
occurrence of Wi^i. Thus, the next phrase is {v'wi^i)'^v'wi = Xj. 

2. Xj-i 7^ e{v, Wi-i). Again by induction, the first phrase, say c, of the parsing that overlaps xj must 
start at the first character of Xj. Since (v'wi-i)'^ does not occur in xi • • • xj-i, the first phrase cannot 
extend past the fourth character of Xj. We have the following subcases. 

(a) c ends at the first character of xj. Therefore v' does not occur in xi • • • Xj_i. Since Xj = 
{v'wi-iYv'wi, we have that the phrase following c, say c', must be either Wi-i or Wi-iv', 
depending on whether or not Wi-i occurs in xi • • • Xj_i. (1) When c' is Wi^i, the next phrase is 
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{v'wi-iYv'wi and ends on the last character of xj, as required. (2) When c' is Wi-iv', the next 
phrase is {wi-iv')'^Wi, again completing the induction. 

(b) Remaining cases. When c ends at the second and third character of xj, the result follows 
as in (^.1) and (^2), respectively. When c ends at the forth character, the next phrase is 
{v'wi-if'v'wi and ends at the last character of Xj as required. 

□ 



Lemma 6.3 Let X be a batch of S{H) and v be any vertex such that s{v) € X. If s{v) is immediately 
preceded by e{q, v)for some edge (q, v), s{v) is parsed into precisely two phrases during the parsing of X; 
otherwise, s{v) is parsed into precisely three phrases. 

Proof. Assume first that s{v) is immediately preceded by e{q, v) for some edge {q, v). Then e{q, v)s{v) = 



{q' z)^q'vv^{v')^$y for some z. By Lemma 6^, a phrase of the parsing must end with the last character of 
e{q,v). Since does not appear elsewhere in X, the next two phrases of the parsing must be v'^v' and 



In the other case, v does not occur to the left of s{v) in X. Again using Lemma 6.2, the parsing of X 



has a phrase starting at s{v). If v appears to the left of s{v) in X, the parsing produces v , v v', and (f 
Otherwise, it produces v, v^v' , and □ 

Lemma 6.4 Let X be a batch of S(H) and (v, w) be any edge such that e{v, w) € X. Let {v, y) be the 
edge that cyclicly precedes {v, w). Ife{v, w) is immediately preceded in X by e{v, y), then e{v, w) is parsed 
into precisely one phrase during the parsing of X; ife{v, w) is immediately preceded by s{v), then e{v, w) 
is parsed into precisely two phrases; in any other case, e{v, w) is parsed into at least two phrases. 

Proof. By Lemma |6^ , some phrase starts at the first character of e{v,w). Assume e{v,y) immediately 
precedes e{v, w); e{v, y)e{v, w) = {v' z)'^v'y{v'y)'^v'w for some z. The parsing of e{v, w) produces the one 
phrase {v'y)'^v'w = e{v, w). (Nowhere else does this string appear in X.) 

Assume s{v) immediately precedes e{v,w); s{v)e{v,w) = v^{v')^$v{v'y)^v'w. If v'y occurs earlier 
in X, the parsing of e{v,w) produces phrases v'yv' and {yv')^w, because v'yv' cannot occur elsewhere. 
Otherwise, the parsing produces v'y and {v'y)^v'w. 

In any other case, e{v, w) is preceded by a character other than v'. If v' occurs earlier in X, then the 
parsing of e{v,w) produces two phrases as in the case of s{v) preceding e{v,w). Otherwise, the parsing 
produces v' and then at least one more phrase. □ 

Now define a schedule Yi, . . . , to be standard if and only if: for each batch Yi, the order in which the 
strings s{v) appear in Yi corresponds to a path in H; the paths associated to Yi and Yj are disjoint for each 
i 7^ j; and each vertex of H appears as s{v) in some batch Yi. 

We give a polynomial time algorithm that transforms a schedule S = {Xi, . . . ,Xg) into a standard 
schedule that parses into no more phrases than does S. The algorithm consists of two phases. The first 
phase computes a set of disjoint paths that covers all the vertices of H. It iteratively combines paths, guided 
by S, until no further combination is possible. The second phase transforms each path into a batch such that 
the resulting schedule is standard. 

Algorithm STANDARD 

PI 1. Place each vertex f of if in a single-vertex path. If s{v) is the first string in some batch in S, 
label V terminal; otherwise, label v nonterminal. 
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2. While there exists a path with nonterminal left end point, pick one such end point v and process 
it as follows. Let Xi be the batch in which s{v) occurs. Let x{u) be the string (associated 
to either vertex u or to one of its outgoing edges) that precedes s{v) in X^. If x{u) ends in a 
symbol other than v, label vertex v terminal. Otherwise, {u, v) € H, so connect u to v, and, for 
each edge {u,w) £ H, u w, such that s{w) is immediately preceded by e{u,w), declare w 
terminal. (This guarantees that Phase One actually builds paths.) 

P2 Let Ai, . . . ,Athe the paths obtained at the end of Phase One. We transform each path Aj into a 
batch Yj. If Aj consists of a single vertex v, then Yj consists of s{v) followed by all the e{v, wjYs 
arranged in cyclic order. 

Otherwise, Aj contains more than one vertex. Initially Yj is empty. For each edge [u, v) in order 
in the path, we append to Yj: s{u) followed by all of its e{u,Wjys, in cyclic order ending with 
e{u, v). When there are no more edges to process, the last vertex of the path is processed as in the 
singleton-vertex case. 



Lemma 6.5 In polynomial time, Algorithm STANDARD transforms schedule Xi , . . . , Xg into a standard 
schedule Yi, . . . ,Ytofno higher cost. 

Proof. That Algorithm STANDARD runs in polynomial time and Yi , . . . , It is standard follow immedi- 
ately from the specification. 

We now show that each batch Yj parses into no more phrases than do its corresponding components in 
the input schedule. Consider the path, {vi,V2, ■ ■ ■ ,Vr) from which Yj is derived. Let d{v) be the outdegree 
of any vertex v. Yj = s{vi)e{vi,w\) ■ ■ ■ g{vi,w^^^^^) ■ ■ ■ s{vr)e{vr,wl) ■ ■ ■ e{vr,w^^^^^), where the wys 
are the neighbors in cyclic order out of Vi and, for 1 < i < r, we assume without loss of generality that 

By Lemma for 2 < i < r, s{vi) parses into two phrases, which is optimal. By Lemma |64[ for 
1 < i < r and 2 < j < d{vi), e{vi,Wj) parses into one phrase, which is optimal. We thus need only 
consider the parsing of s{vi) and, for 1 < i < r, e{vi, w\). 



The strings e{vi,w\), for 1 < i < r each parse into two phrases in Yj, by Lemma 6.4. There must 
be some e{vi,x) that is not immediately preceded by its cyclic predecessor in some X^, and this instance 
of e{vi,x) also parses into at least two phrases, by Lemma This accounts for the first e(-) string 
immediately following each s(-) string in Yj. 

Finally, if s{vi) is not immediately preceded by some e{v,vi) in the input batch Xf^ in which s{vi) 



appears, we are done, for s{vi) is parsed into three phrases in both X^ and Yj, by Lemma |63|. Otherwise, 
consider the maximal sequence e{v , Wa)e{v , Wa+i) ■ ■ ■ e{v,Wa+i = vi)s{vi) in X^, where the Wa^ are 
cyclicly ordered neighbors of v. Because STANDARD declared vi to be terminal, there was another edge 
{v, y) such that e{v, y) immediately preceded s{y) in some X^i, which STANDARD used to connect v and 
y in some path. This engenders an analogous maximal chain of e{v, ■) strings followed by s{y) in X^i. 
Thus, there are at least two strings e{v,-) not immediately preceded in the input by their cyclic predeces- 



sors; Lemma 5.4 implies each is parsed into at least two phrases. We can charge the extra phrase generated 
by s{vi) in Yj against one of them, leaving the other for the extra phrase in the parsing of the e{v, ■) phrase 
immediately following s{v) in some Yji. □ 

Lemma 6.6 A batch Yj output by STANDARD, corresponding to a path {vi, . . . ,Vr), parses into 3r + 1 + 
Ya=i ^(^«) phrases. 



18 



Proof. By Lemma S3, each s(-) string parses into 2 phrases, except s{vi), which parses into 3, contributing 
2r + 1 phrases. Lemma SA implies that each e(-) parses into 1 phrase, except each following an s(-), which 
parses into 2, contributing r + ^i'^i) phrases. □ 



Theorem 6.7 Problem 5.1 is MAX-SNP hard when C is LZ77. 



Proof. Let the graph H defined at the beginning of the section have Uh vertices and rrih edges. Let k be the 
minimum number of cost-2 edges that suffice to form a TSP(1,2) solution. Then the cost of the solution is 

— 1 + k. Associating strings to vertices and edges of H, as discussed above, we argue that the optimal 
schedule for those strings produces nih + k + 3n/i + 1 phrases. The reduction is linear, since nih = 0{nh) 
by the assumption of bounded outdegree. 

Assume that the TSP(1,2) solution with k cost-2 edges is the path vi,V2, ■ ■ ■ , Vn^- Then in polynomial 
time we can construct a corresponding standard schedule of the form output by STANDARD, which Lemma 



6.6| shows parses into rrih + k + 3n/, + 1 phrases. 

For the converse, assume that we are given a schedule of cost + + Srih + 1. By Lemma 5^, we can 
transform it in polynomial time into a standard schedule Yi , . . . , of no higher cost. Recall that to each 
batch we can associate a path of H. Let vi,V2, ■ ■ ■ , Vn^ the an ordering of the vertices of H con^esponding 
to an arbitrarily chosen processing order for the sequence of batches. Then, H cannot be missing more 



than k of the edges {vi,Vi+i), or else, by Lemma 6^, the cost of the standard schedule would exceed 
mh + k + 3nh + 1. □ 



7 Complexity with Run Length Encoding 

In run length encoding (RLE), an input string is parsed into phrases of the form {a, n), where a is a char- 
acter, and n is the number of times a appears consecutively. For example, aaaabbbbaaaa is parsed into 

(a,4)(6,4)(a,4). 



7.1 Problem 5.1 



Theorem 7.1 Problem 5.1 can be solved in polynomial time when C is run length encoding. 



Proof. Let xi, . . . , x„ be the input strings. We can assume without loss of generality that each xi is of 
the form aa'; i.e., two distinct characters. The parsing of any characters between them cannot be optimized 
by rearranging the strings. Furthermore, if xi = aa, we can simply merge Xi with another string, Xj, that 
begins or ends with a; if no such Xj exists, we can ignore Xi completely, since again its parsing cannot be 
optimized by rearrangement. 

We claim that a shortest common superstring (SCS) of the input corresponds to an optimal schedule. 
As described earlier, an SCS is pref{xT^^ , Xj^^) ' ' ' P^^fi^irn-i ■> ^TTn)^T^n for some permutation vr. Note that 
pref{xi , Xj) is of length 2 if the last character of Xi equals the first of xj and 3 otherwise. Thus, an SCS gives 
an optimal RLE parsing, and SCS can be solved in polynomial time when all input strings are of length two 

Ml ° 



7.2 Problem 5.3 



As in Section |6. 4 we transform the vertices and edges of H into an instance of Problem p.3[ We associate 
a column to each vertex and edge of H. 

For each vertex v, we generate three symbols: v, v', and v". Let wq, . . . , w^-i be the vertices on the 
edges out of v in some fixed, arbitrary cyclic order. We associate the following strings to v and its outgoing 
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edges: s{v) = v'v"v; and e(f , Wi) = v'v"wi, < i < d. The input table is formed by assigning each such 
string, over all the vertices, to a column. 

Consider a TSP(1,2) solution with k cost-2 edges. We can arrange the induced strings into a table T 
describing these paths. Place all strings corresponding to a vertex t; in a contiguous interval of the table with 
s{v) being the first column of the interval. For any edge {v, q) in the collection of paths, place the interval 
corresponding to q immediately after that corresponding to v, and place the string e{v, q) last in the interval 
for v; otherwise, the order of the intervals and of the strings corresponding to edges can be arbitrary. We say 
the table is in standard form for the collection of paths. 

Theorem 7.2 Problem is MAX-SNP hard for tables of at least 3 rows when C is run length encoding. 

Proof. We prove the theorem for three rows first and then extend it to larger numbers of rows. Let 11^ and 
m/i be the number of vertices and edges in H, rsp., and let n = rih + ruh be the number of columns in the 
induced table. Associate strings to the vertices and edges as described above. Let k be the minimum number 
cost-2 edges that suffice to form a TSP(1,2) solution for H. Then the cost of the solution is rih — I + k. Let 
vi,V2, ■ ■ ■ , Vn^ be an ordering of the vertices in H corresponding to the A; + 1 disjoint paths. Let T be the 
corresponding standard form table. Let S be the schedule obtained by taking as a batch each interval of the 
table corresponding to a path. The row-major cost of 5 is 2n + ruh + A; + 1. This completes one direction 
of the transformation. 

As for the other direction, assume that we are given a solution to the instance of optimum table com- 
pression that has cost 2n + m/^ + A; + 1. Let T' be the table of the solution schedule. In polynomial time, we 
can transform T' into a standard form table T with a schedule of at most the same cost. We simply observe 
that, if the e(-) and s(-) strings for any vertex are not contiguous, we can rearrange the columns to make 
them so, saving at least two phrases and generating at most two in the new parsing. 

Since a table in standard form corresponds to an ordering of the vertices, it must be that H cannot be 
missing more than k edges, or else the cost of the table in standard form would be greater than 2n + ruh + 
k + 1. 

When the number of rows m exceeds three, we use one additional character $. Each string is as in the 
case m = 3, except that now is augmented to end with the suffix $'"~'^. This would add one more phrase to 
the parsing of the set of strings, and the linearity of the transformation still holds. □ 

8 Conclusion 

We demonstrate a general framework that links independence among groups of variables to efficient par- 
titioning algorithms. We provide general solutions in ideal cases in which dependencies form equivalence 
classes or cost functions are sub-additive. The application to table compression suggests that there also exist 
weaker structures that allow partitioning to produce significant cost improvements. Open is the problem of 
refining the theory to explain these structures and extending it to other applications. 

Based on experimental results, we conjecture that our TSP reordering algorithm is close to optimal; i.e., 
that no partition-based algorithm will produce significantly better compression rates. It is open if there exists 
a measurable lower bound for compression optimality, analogous, e.g., to the Held-Karp TSP lower bound. 

Finally, while we have shown some MAX-SNP hardness results pertaining to table compression, it is 
open whether the problem is even approximable to within constant factors. 
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